-
Notifications
You must be signed in to change notification settings - Fork 168
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimise queries search for a chain of OR strings #3250
Conversation
First make a sort based on column id. This will allow us to only pass through the array of conditions once. Comparing column indecies before making dynamic casts. Early out if column has search index.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice work, but see comment.
src/realm/query_engine.hpp
Outdated
auto it = m_conditions.begin(); | ||
while (it != m_conditions.end()) { | ||
// Only try to optimize on StringNode<Equal> conditions without search index | ||
if (bool(*it) && (first = dynamic_cast<StringNode<Equal>*>(it->get())) && !first->has_search_index()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sligtly confused by this bool(*it) ... if it is necessary here, then how come we don't need it before dereferencing "next" in line 1691 ?
Perhaps some simplification is possible?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, I removed the check on bool(*it)
, it should be covered by the dynamic cast anyway.
Here's the actual benchmark results for reference (100000 rows, 1000 query conditions): Before:
After:
|
What's the difference like with two conditions? I'd expect this to be slower for a sufficiently small number of conditions, but whether or not that's anything worth caring about depends on how much slower it is and where the break even point is. |
It's a good point, there is an overhead cost of computing a string hash. The majority of time is spent computing our custom StringData hash (not sure how performant it tries to be). That being the case, I added a simple loop check to iterate through the conditions and check for matches without hashing anything. Based on the following tests, I found that the threshold for choosing this over the hash is around 20 conditions. (the axis units are: milliseconds vs number of conditions)
|
This is a performance enhancement motivated by users who are generating queries with many string comparisons on a single column, for example from cocoa's "IN" queries. The idea is to combine string equality conditions from a single "OR" query node and store them in an unordered_set. With N elements to search, and C conditions, the runtime changes from O(N*C) to O(N). The added benchmark goes from 30 seconds to 2 seconds. This change does not try to optimise indexed columns which should be running O(log(N)*C). The benchmark with indexes turned on runs in 3.5 seconds. Since N is likely the dominant term, using indexes should still be fastest in practice when compared to this optimisation.
This is a performance enhancement motivated by users who are generating queries with many string comparisons on a single column, for example from cocoa's "IN" queries; see https://github.com/realm/engineering/issues/22
The idea is to combine string equality conditions from a single "OR" query node and store them in an unordered_set. With N elements to search, and C conditions, the runtime changes from O(N*C) to O(N). The added benchmark goes from 30 seconds to 2 seconds. This change does not try to optimise indexed columns which should be running O(log(N)*C). The benchmark with indexes turned on runs in 3.5 seconds. Since N is likely the dominant term, using indexes should still be fastest in practice when compared to this optimisation.